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Abstract 

Subspace clustering refers to the problem of clustering unlabeled high-dimensional data 
points into a union of low-dimensional linear subspaces, whose number, orientations, and dimen¬ 
sions are all unknown. In practice one may have access to dimensionality-reduced observations 
of the data only, resulting, e.g., from undersampling due to complexity and speed constraints 
on the acquisition device or mechanism. More pertinently, even if the high-dimensional data 
set is available it is often desirable to first project the data points into a lower-dimensional 
space and to perform clustering there; this reduces storage requirements and computational 
cost. The purpose of this paper is to quantify the impact of dimensionality reduction through 
random projection on the performance of three subspace clustering algorithms, all of which are 
based on principles from sparse signal recovery. Specifically, we analyze the thresholding based 
subspace clustering (TSC) algorithm, the sparse subspace clustering (SSC) algorithm, and an 
orthogonal matching pursuit variant thereof (SSC-OMP). We find, for all three algorithms, that 
dimensionality reduction down to the order of the subspace dimensions is possible without in¬ 
curring significant performance degradation. Moreover, these results are order-wise optimal in 
the sense that reducing the dimensionality further leads to a fundamentally ill-posed clustering 
problem. Our findings carry over to the noisy case as illustrated through analytical results for 
TSC and simulations for SSC and SSC-OMP. Extensive experiments on synthetic and real data 
complement our theoretical findings. 


1 Introduction 

One of the major challenges in modern data analysis is to find low-dimensional structure in large 
high-dimensional data sets. A prevalent low-dimensional structure is that of data points lying in 
a union of (low-dimensional) subspaces. The problem of extracting such a structure from a given 
data set can be formalized as follows. Consider the (high-dimensional) set y of points in M m and 
assume that y = Ti U ... U Tl> where the points in y\ lie in a linear subspace S of M m . The 
association of the data points to the sets jy the orientations, dimensions, and the number of the 
subspaces <Sare all unknown. The problem of identifying the assignments of the points in y to 
the y £ is referred to as subspace clustering |34J or hybrid linear modeling and has applications, 
inter alia, in unsupervised learning, image representation and segmentation, computer vision, and 
disease detection. 
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In practice one may have access to dimensionality-reduced observations of 3* only, resulting, 
e.g., from “undersampling” due to complexity and speed constraints on the acquisition device 
or mechanism. More pertinently, even if the data points in 3* are directly accessible, it is often 
desirable to work on a dimensionality-reduced version of 3^ as this reduces data storage cost and 
leads to computational complexity savings. The idea of reducing computational complexity through 
dimensionality reduction appears, e.g., in [32] in a general context, and for subspace clustering in 
the experiments reported in [38 . 9]. Dimensionality reduction also has a privacy-enhancing effect 
in the sense that no access to the original data is needed for processing [25] . 

Dimensionality reduction will, in general, come at the cost of clustering performance. The 
purpose of this paper is to analytically characterize this performance degradation for three subspace 
clustering algorithms, namely thresholding-based subspace clustering (TSC) [16], sparse subspace 
clustering (SSC) [8], [9], and SSC-orthogonal matching pursuit (SSC-OMP) [7]. The common theme 
underlying these three algorithms is that they apply spectral clustering to an adjacency matrix 
constructed from sparse representations of the data points, obtained through a nearest neighbor 
search in the case of TSC, through -minimization for SSC, and through OMP in the case of SSC- 
OMP. While there are numerous further approaches to subspace clustering (see [33] for an overview), 
we chose to study TSC, SSC, and SSC-OMP, as they belong to the small group of subspace clustering 
algorithms that are computationally tractable and succeed provably under nonrestrictive conditions 
(28[ (29[ [9[ 0 EZl HE] • Specifically, the results in [16] for TSC, and in [28] [29] for SSC show that TSC 
and SSC can succeed even when the subspaces Si intersect. The corresponding proof techniques, 
together with analytical performance guarantees for SSC-OMP developed in this paper, form the 
basis for our analytical characterization of the impact of dimensionality reduction on subspace 
clustering performance. 

Formal problem statement and contributions. Consider a set of N data points 3* € M m , 
and assume that 3 = 3i U ... U 33, where the points y f'* £ 33, * € {1 lie in a di- 

dimensional linear subspace of M m , denoted by S(. Neither the assignments of the points in 3* to 
the sets 33 nor the subspaces Si or the number of subspaces L are known. Traditional subspace 
clustering operates on the data y with the goal of segmenting it into the sets 33 - Here, we assume, 
however, that clustering is performed on a dimensionality-reduced version of the points in 3*- 
Specifically, we employ the random projection method [32] by first applying the (same) realization 
of a random projection matrix $ € R pxm (typically p <C m ) to each point in y to obtain the set of 
dimensionality-reduced data points X. Then, we declare the segmentation obtained by operating 
on X to be the segmentation of the data points in y. The realization of $ does not need to 
be known. There are two error sources that determine the performance of this approach, first, 
the error that would be obtained even if clustering was performed on the high-dimensional data 
set y directly, second, and more pertinently, the error incurred by operating on dimensionality- 
reduced data. The former is quantified for TSC in [16], for SSC in [28], [29], and for SSC-OMP 
this paper develops corresponding new results. Analytically characterizing the error incurred by 
dimensionality reduction is the main contribution of this paper. 

While it is conceivable that TSC, which is based on thresholding inner products, exhibits graceful 
performance degradation as the data set’s dimensionality is reduced through random projection, 
this is far from obvious for the 0-minimization based SSC algorithm and the iterative SSC-OMP 
algorithm. We prove our main results by first deriving conditions for TSC, SSC, and SSC-OMP to 
ensure correct clustering of dimensionality-reduced data. While these conditions are general, they 
only become amenable to insightful interpretations once particularized for a random data model, 
also used in [251 [16] . that takes the subspace structure of the data set into account. The resulting 
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clustering conditions make the impact of dimensionality reduction explicit and reveal a tradeoff 
between the affinity of the subspaces Sc and the amount of dimensionality reduction possible. 
Specifically, we find that all three algorithms succeed provably under quite generous conditions on 
the relative orientations of the subspaces Sc, provided that the dimensionality is reduced no more 
than down to the largest subspace dimension d max = max< dc . As the computational complexity 
associated with the construction of the adjacency matrix is essentially linear in the dimension of the 
ambient space, m, for all three algorithms, random projection reduces the complexity of this step by 
a factor of m/d max . These complexity savings translate into, possibly significant, run-time savings 
for the overall clustering algorithms (which include the spectral clustering step), in particular when 
m is sufficiently large relative to N. 

We study the impact of noise—added to the high-dimensional data points—on clustering perfor¬ 
mance. For TSC, we derive a clustering condition which quantifies the tradeoff between the affinity 
of the subspaces Sc and the amount of dimensionality reduction possible, as a function of noise 
variance. Specifically, this condition allows us to conclude that TSC succeeds provably provided 
that—as in the noiseless case—the dimensionality is reduced to no more than down to the largest 
subspace dimension d max , and the noise variance is sufficiently small. An approach akin to that used 
for TSC can be applied to establish a similar clustering condition for SSC-OMP. The corresponding 
technical details are, however, significantly more involved and cumbersome. We therefore decided 
not to state the formal result. Regarding SSC, we remark that Wang et al. [36] reported determin¬ 
istic clustering conditions for the Lasso-version of SSC [29] applied to dimensionality-reduced noisy 
data. However, the corresponding results [ 561 Lem. 16, Thm. 18] make the critical assumption of 
the signal part of the projected noisy data being normalized, whereas the noise component remains 
un-normalized. It is difficult to see how one would realize this in practice, unless the noise real¬ 
ization is known perfectly, in which case the noise component could be removed which would take 
us back to the noiseless case. The results in [36] for noisy data therefore appear to be of limited 
practicality. While the statements in [36] may be particularized to the noiseless case, we note that 
corresponding results appeared in the conference version m of this paper before the publication 
of EHJ. 

We note that our results, both for the noiseless and the noisy case, apply even when the 
subspaces Sc span the ambient space R m . This follows from our clustering conditions depending on 
the pairwise affinities between subspaces only, and pairwise affinities changing only moderately if 
the dimensionality is reduced down to no more than the order of the individual subspace dimensions. 

Another popular dimensionality reduction method is principal component analysis (PCA). How¬ 
ever, when used in the context of subspace clustering, PCA allows dimensionality reduction down 
to the dimension of the overall span of the subspaces only, in general; this results in no dimen¬ 
sionality reduction at all when the subspaces Sc span the ambient space. To see this, consider the 
L subspaces of dimension 1 that correspond to the standard basis in M m , i.e., the l -th subspace 
is spanned by the vector ec given by [ec}t = 1 and \&e\i = 0, for i ^ l. Assuming that each of 
the data points in the data set under consideration, denoted by Y 6 M. mxN , lies in one of these 
L subspaces, the corresponding sample covariance matrix YY 7 has non-zero entries only in its 
first L main diagonal entries. The first L principal components are therefore given by the vec¬ 
tors ec- Reducing the dimensionality of the data set to below L will result in certain data points 
being mapped to zero (owing to the orthogonality of the ec). Moreover, PCA has computational 
complexity 0(Nm 2 + m 3 ) while random projection through Gaussian matrices and fast random 
projection matrices [1] has complexity 0(pmN) and 0(\og(m)mN), respectively, and is therefore 
computationally much less demanding. This is an important aspect as computational complexity 
is a major motivation for dimensionality reduction. 
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Notation. We use lowercase boldface letters to denote (column) vectors and uppercase boldface 
letters to designate matrices. The superscript T stands for transposition. For the vector x, x q 
denotes its qth. entry and xg is the subvector of x with entries corresponding to the indices in 
the set S. For the matrix A, A ^ designates the entry in its zth row and jth column, Ag the 
matrix containing the columns of A with indices in the set S, ||A || 9 _ >2 := max || v || 2 =i ||Av || 2 
spectral norm, cr m i n (A) its minimum singular value, and ||A|| F := (X)* j |Ajj | 2 ) 1//2 its Frobenius 
norm. If A has full column rank A^ := (A 2 A) 1 A 7 stands for its (left) pseudoinverse, and for 
A with full row rank, A^ := A r (AA J ) is the (right) pseudoinverse. The identity matrix is 
denoted by I. log(-) refers to the natural logarithm, arccos(-) is the inverse function of cos(-), and 
x Ay denotes the minimum of x and y. The set {1,... , N} is written as [N]. The cardinality 
of the set S is designated by \S\ and its complement is S. A /"(/Lt, X) stands for the distribution 
of a real Gaussian random vector with mean p and covariance matrix 51. We write X ~ Y to 
indicate that the random variables X and Y are equally distributed. For notational convenience, 
we use the following shorthands: nrax^ for max^g^j, max/^ for max^g^j. and maxj.^. for 
maxjt^gji]. The unit sphere in M m is § m_1 := {x S M m : ||x|| 2 = 1}. A subgraph H of a graph 
G is said to be connected if every pair of nodes in H can be joined by a path along edges with 
nodes exclusively in H. A subgraph H of G is called a connected component of G if H is connected 
and if there are no edges between nodes in H and the remaining nodes in G. 

2 A brief review of TSC, SSC, and SSC-OMP 

We next briefly summarize the TSC [lfij, SSC [H 9], and SSC-OMP [7] algorithms. All three 
algorithms apply normalized spectral clustering [35] to an adjacency matrix A built by finding a 
sparse representation of each data point in terms of the other data points. Specifically, TSC is based 
on least-squares representations in terms of nearest neighbors while SSC and SSC-OMP construct 
A by finding sparse representations via .^-minimization and OMP, respectively. Note that the focus 
in [T6] is on a version of TSC that uses a spherical distance measure between data points instead of 
least-squares regression coefficients to determine the entries of A. The analytical results presented 
here apply to both versions of TSC. We decided, however, to work with the least-squares version 
as this formulation better elucidates the sparsity aspect and thereby the relationship to SSC and 
SSC-OMP. 

In order to emphasize that we consider all three algorithms applied to dimensionality-reduced 
data, their descriptions will be in terms of the dimensionality-reduced data set X C M p . We 
furthermore assume that an estimate L of the number of subspaces L is available. The estimation 
of L from X is discussed later. We also note that the formulations of the TSC and SSC-OMP 
algorithms below assume that the data points in X are of comparable t^-norm. This assumption is 
relevant for Step 1 in both cases and is not restrictive as the data points can be normalized prior 
to clustering. 

The TSC algorithm: Given a set of N data points X in M p , an estimate of the number of 
subspaces L, and the parameter q, perform the following steps: 

Step 1 : For every Xj € X, find the set Sj C [lV]\{j} of cardinality q defined by 

|(xj,Xj)| > |(xj,Xfc)|, for all i € Sj and all k ^ Sj, 

and let z j be the coefficient vector corresponding to the minimum least-squares representation 
of Xj in terms of x*, i € Sj. Specifically, set (z j)g j = argmin z ||xj — X,^z11 2 (if multiple solutions 
exist, choose, e.g., the z with minimum f^-norm), and (z j)-g, = 0. Construct the adjacency matrix 
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A according to A = Z + Z T , where Z = abs([zi ... z n}) and abs(-) takes absolute values element¬ 
wise. 

Step 2: Apply normalized spectral clustering (2B,25] to (A ,L). 

The SSC algorithm: Given a set of N data points X in W p and an estimate of the number 
of subspaces L, perform the following steps: 

Step 1: Let X E M pxAr be the matrix whose columns are the points in X. For every xj E X 
determine z j as a solution of 

minimize || z || x subject to xj = Xz and Zj = 0. (1) 

Construct the adjacency matrix A according to A = Z + Z 1 , where Z = abs([zi ... zjv])- 

Step 2: Apply normalized spectral clustering [26l [35] to (A ,L). 

The SSC-OMP algorithm: Given a set of N data points X in M p , an estimate of the number 
of subspaces L, and a maximum number of OMP iterations s max , perform the following steps: 

Step 1: For every xj E X, find a sparse representation of x ? in terms of T\{x / } using OMP as 
follows: Initialize the iteration counter s = 0, the residual ro = Xj, and the set of selected indices 
Ao = 0. For s = 1,2,... perform updates according to 

A s = A s _i U argmax |(xj,r s _i)| (2) 

*S[iV]: i^j 

r s = (I-X A .xijx y (3) 

until r s = 0 or s = s max (when the maximizer in d2J) is not unique, select any of the solutions). 
With the number of OMP iterations actually performed denoted by Send, set = xj Y Xj, 

(zj)y = 0, and construct the adjacency matrix A according to A = Z + Z T , where Z = 
abs([zi ... zat]). 

Step 2: Apply normalized spectral clustering [26l 135] to (A ,L). 

For all three algorithms the number of subspaces L can be estimated based on the insight that 
the number of zero eigenvalues of the normalized Laplacian of the graph G with adjacency matrix 
A, henceforth simply referred to as “the graph G”, is equal to the number of connected components 
of G [30j. A robust estimator for L is the eigengap heuristic described in [35] . 

Let the oracle segmentation of X be given by X = XiU.. .UX^. If each connected component in 
the graph G corresponds exclusively to points from one of the sets Xg, spectral clustering will deliver 
the oracle segmentation m Prop. 4] and the clustering error, i.e., the fraction of misclustered 
points, will be zero. Since conditions guaranteeing zero clustering error are inherently hard to 
obtain, we will work with an intermediate, albeit sensible, performance measure, also employed in 
EM Eg E]. Specifically, this measure, termed the no-false connections property, declares success 
if the graph G has no false connections, i.e., if each Xj E Xg is connected to points in Xg only, for all 
t. Guaranteeing the absence of false connections, does, however, not guarantee that the connected 
components of G correspond to the Xg, as the points in a given set Xg may form two (or more) 
distinct connected components in G. 

To counter this problem sufficiently many entries in each row/column of the adjacency matrix A 
have to be non-zero. Specifically, for the subgraphs of G corresponding to the Xg to be connected, 
each row/column of A corresponding to a point in Xg needs to have between 0(logng) and 0(ng ) 
non-zero entries. As the solutions z to argmin z 1 1 xy — X_svz 11 2 are typically dense, TSC is likely to 
select a representation of xj in terms of points in Xg\{xj} with on the order of q non-zero coefficients. 
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Choosing q large enough therefore ensures sufficient connectivity of the graph G generated by 
TSC. On the other hand, taking q to be large increases the probability of false connections. The 
performance guarantee we obtain for TSC therefore requires q to be sufficiently small relative to 
the ng. 

For SSC and SSC-OMP, the number of non-zero entries in each row/column of A turns out to be 
tied to dg, rather than ng. To see this, suppose that both algorithms exclusively select data points 
from Xg\{xj } to represent Xj. Moreover, assume that the Xg are non-degenerate in the sense that, 
indeed, dg points are needed to represent xj £ Xg through points in Xg\{xj}; this precludes, e.g., 
that Xg contains multiple copies of the same data point. The OMP algorithm in SSC-OMP then 
terminates after min(dg. s max ) (recall that dg = dim(iS^)) iterations for x ? £ Xg and hence results in 
exactly muiidg, s max ) non-zero entries in the corresponding column of Z (recall that A = Z + Z 7 ). 
For SSC, we simply note that dg points are enough to represent Xj £ Xg through other points in 
Xg and we cannot guarantee more than dg non-zero entries in the corresponding column of Z, in 
general. This will lead to insufficient connectivity for SSC and SSC-OMP when dg is not in the 
range O (log ng)-0(ng). The problem is exacerbated when the data set is degenerate. To counter 
insufficient connectivity in SSC a modification which adds an ^-penalty to the cost function in 
(fTj) was proposed in [9 : , Sec. 5]. Such a modification is not known for SSC-OMP, and this may be 
considered a limitation of SSC-OMP. 

We finally remark that TSC and SSC-OMP can be made essentially parameterless, like SSC. 
Specifically, a procedure for choosing the TSC parameter q in a data-driven fashion is described in 
m , and for SSC-OMP we can get rid of the parameter s max by stopping the OMP step once the 
^ 2 -norm of the residual r s falls below a threshold value. 


3 Main results 


We start by specifying the statistical data model used throughout the paper. The subspaces Sg 
are taken to be deterministic and the points within the Sg are chosen randomly. Specifically, the 
elements of the set 34 in y = 34 U ... U 34, are obtained by choosing ng points at random according 
t° yf } = U^a j ( \j £ [ng\, where the columns of £ R mxdf form an orthonormal basis for the 
c^-dimensional subspace Sg, and the are i.i.d. uniform on 1 . As the are orthonormal, 

the data points y'p are distributed uniformly on the set {y £ Sg: ||y|| 2 = 1} = Sg n S m_1 , which 
avoids degenerate situations where the data points lie in preferred directions. To see why such 
degeneracies can lead to ambiguous results, consider a two-dimensional subspace and assume that 
the data points in this subspace are skewed towards two distinct directions. Then, there are two 
sensible segmentations. One is to assign the points corresponding to each direction to separate 
clusters, the other to assign all points to one cluster. 

The dimensionality-reduced data set X C is obtained by applying the (same) realization 
of a random matrix $ £ M pxm (p > ma xgdg) to each point in y. The elements of the sets Xg 
in A = XgU ... U Xl are hence given by x^ = <Fy f\j € [ng\. We take $ as a random matrix 
satisfying the following concentration inequality 


P 




> t llx 


2 

2 


< 2e 


-ct z 


Vt > 0,Vx € M m , 


(4) 


where c is either a numerical constant or a parameter mildly depending on m. Random matri¬ 
ces satisfying (J4]) realize, with high probability, linear embeddings in the sense of the Johnson- 
Lindenstrauss (JL) Lemma, see e.g., [32], [TUI Sec. 9.5]. The JL Lemma says that every set 
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of N points in Euclidean space can be embedded in an 0(e~ 2 log 7V)-dimensional space without 
perturbing the pairwise Euclidean distances between the points by more than a factor of 1 ± e. 

A similar statement on random projection preserving affinities between subspaces-as defined in 
© -is used in our proofs. Specifically, we show that randomly projecting a set of d-dimensional sub¬ 
spaces into p-dimensional space does not increase their pairwise affinities by more than const. y/d/p, 
with high probability (cf. (1451) 1. The concentration inequality © holds, inter alia, for matrices with 
i.i.d. subgaussiar@ entries m Lem. 9.8]; this includes A7(0, 1/p) entries and entries that are uni¬ 
formly distributed on {—1 /y/p, 1/y/p}. Such matrices may, however, be costly to generate, store, 
and apply to high-dimensional data points. In order to reduce these costs structured random ma¬ 
trices satisfying © (with c possibly mildly dependent on m) were proposed in [T1I2T]. For example, 
the structured random matrix proposed in [1] (and described in detail in Section © satisfies © 
with c = C 2 log 4 (m), where C 2 is a numerical constant m Prop. 3.2], and can be applied in time 
0{m log m) as opposed to time 0(mp) for the realizations of general subgaussian random matrices. 

The clustering performance guarantees we obtain below are all in terms of the affinity between 
the subspaces Sk and Si defined as [28, Def. 2.6], |2j). Def. 1.2] 

■*<*•*> = (5) 

Note that 0 < aff (<S^. S() < 1, with aff(Sk,Si) = 1 if Sk C Si or Si C Sk and aff(«Sfc, <S^) = 0 if 5*. 
and Si are orthogonal to each other. Moreover, we have 

aS(S k ,Si) = ^Jcos 2 (9i) + . .. + cos 2 (d dkAde )/y/d k A d e , (6) 

where 0\ < ... < Od k /\d e are the principal angles between S\ and Si [T2, Sec. 6.3.4], If S & and Si 
intersect in t dimensions, i.e., if SkC\Si is t-dimensional, then cos($i) = ... = cos (9t) = 1 and hence 
aff(5fc,5^) > y/t/(dk A di ). The affinity between subspaces plays an important role in subspace 
classification m as well, see [19], Thms. 2 and 3]. 

We start with our main result for TSC. 


Theorem 1. Choose q such that q < min^ ni /6. If 


max affhS;,, SA + 
k,i: 


vTT \fd 

max 

\/32 yfP 


< 


1 

15 log N' 


(7) 


where d max = rriax^ di and c is the constant in the concentration inequality ©■ then the graph 
G obtained by applying TSC to X has no false connections with probability at least 1 — 7 N~ 2 — 
Y2e=i where c > 1/20 is a numerical constant. 


Our main result for SSC is the following. 


Theorem 2. Let pi := (ni — l)/di, I € [L\, /3 m ; n := min i pi > po, where po > 1 is a numerical 
constant, and pick any r > 0. Set d ma x = rnax^ di and suppose that 


max aff(5fc, 5/) + 
k,£: k£l 


28d max +8 log L +2 t 


3 cp 


< 


%/log Pmin 
65 log N 


( 8 ) 


where c is the constant in ©• Then, the graph G obtained by applying SSC to X has no false 
connections with probability at least 1 — 4e -r / 2 — IV -1 — Yle =l ■ 

X A random variable x is subgaussian jlOl Sec. 7.4] if its tail probability satisfies P[|x| > t] < cie _C2t for constants 
Cl, C2 > 0. 
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Finally, for SSC-OMP we obtain the following statement. 


Theorem 3. Let pi := (ng — 1 )/dg, I G [L\, p m ; n := min gpg > po, where po > 1 is a numerical 
constant, and pick any r > 0. Set d m i n := min^c^, d max := max^e^, and suppose that $ has (in 
addition to satisfying the concentration inequality ®) a rotationally invariant distribution, i.e., 
$ for all unitary matrices V E M mxm . If 


max afffiSi., Sp) + 
k,i: k^i 


1 28 d max + 8 log L + 2r d, 


12 cp 


dn 


< 


3 yiogp^ 

200 log N 


(9) 


where c is the constant in ®, then, irrespectively of the choice of the maximum number of OMP- 
iterations s max; the graph G obtained by applying SSC-OMP to X has no false connections with 
probability at least 1 — 4e~ r / 2 * — 4A r_1 — i nge~^ d£ . 


The proofs of Theorems [U [21 and 0 are provided in Appendices 0 [El and 0 respectively, and 
are established by first deriving deterministic clustering conditions that are then evaluated for our 
statistical data model. 

Theorems [Tj and [2] essentially say that even when p is on the order of d max , TSC and SSC 
succeed with high probability if the affinities between the subspaces Sg are sufficiently small and if 
X contains sufficiently many points from each subspace. The same conclusion applies to SSC-OMP 
provided that the term y 7 d m^ /d m w is not too large, which is the case if the dimensions dg, I € [L\, 
of the subspaces are of the same order. This condition is satisfied in many practical applications, 
such as, e.g., for the face clustering and the handwritten digit clustering problems described in 
Section [5j We believe the occurrence of the factor ydmaxT^min in ® to be an artifact of our proof 
technique. Also note that Theorem [3] imposes more restrictive conditions on $ than Theorems Q] 
and [21 namely the distribution of $ has to be rotationally invariant. This is a technical condition 
and it is not implied by ®. Examples of rotationally invariant matrices satisfying ([4]) include 
matrices with i.i.d. J\f(0, 1/p) entries. 

Theorems [T| [3] apply even when the subspaces Sg span the ambient space M m . This follows by 
virtue of our clustering conditions depending only on the pairwise affinities between subspaces, and 
pairwise affinities changing only moderately if the dimensionality is reduced down to no more than 
the order of the individual subspace dimensions. 

Theorems Q] [3] show that for all three algorithms, p may be taken to be linear (up to log-factors) 
hi dmax- We can therefore conclude that the dimensionality of the data set y can be reduced 
down to the order of the largest subspace dimension without affecting clustering performance 
significantly. This has important practical ramifications as, for all three algorithms considered, the 
computational complexity associated with the construction of the adjacency matrix is essentially 
linear in the dimension of the ambient space the data points “live in”. To get an idea of the 
resulting overall complexity savings, let us consider the TSC algorithm and assume that the (high¬ 
dimensional) data set y C R m is projected down to M p , with p = 0(d max log 2 (A r )), via a Gaussian 
random projection; this choice of p guarantees, by Theorem [H that clustering performance is not 
affected significantly by dimensionality reduction. The complexity associated with the construction 
of the adjacency matrix for TSC is given by the cost of computing the inner products between all 
pairs of data points, and is therefore 0(mN 2 ) for the original data set y C M m and 0(pN 2 ) for 
the projected data set X C M p . Adding the cost for applying the Gaussian random projection 
results in an overall cost of 0(pN 2 ) + 0(pNm) = 0(d max log 2 (N)N(N + m )) for building the 

2 While the statement holds irrespectively of s ma x, recall from Section [2] that choosing s ma x too small may result 

in too few non-zeros in the adjacency matrix A for successful clustering. 











adjacency matrix associated with X. The resulting complexity savings for TSC are therefore given 
by 0(min(m, N)/(d meLX log 2 (IV)). The absolute run-time savings are even more pronounced for 
SSC-OMP and SSC, as the corresponding costs for building the adjacency matrix is larger than 
0(mN 2 ). Further gains can be obtained by employing fast random projections pp|. 

Dimensionality reduction affects the computational cost associated with the construction of the 
adjacency matrix only. The spectral clustering step, which when naively implemented has com¬ 
plexity 0(N 3 ), may be the dominating factor in the overall computational cost, in particular when 
rn is small relative to N 3 . Notwithstanding, dimensionality reduction can still lead to significant 
total run-time savings. Our numerical results in Section [5] demonstrate this for SSC. To see savings 
on the same order for SSC-OMP and TSC, we would have to consider problems with N smaller 
relative to m. 

The probability lower bounds in Theorems [T] [3] are independent of p and m and require the 
total number of data points N to be large in absolute terms in order to ensure a success probability 
close to one. 

Theorems mm are order-optimal in the following sense. If dimensionality is reduced to below 
dmaxj then, in general, there are points from different subspaces that are projected into the same 
lower-dimensional subspace, which renders the resulting clustering problem fundamentally ill-posed. 
To see this, take dg = d , for all £, and assume that p < d. Next, note that the (randomly projected) 
points Xf lie in the column span of 3>U^. As is a basis for the d^-dimensional subspace 
Sg C R m , the span of is R p , for all £, and therefore all points in the projected data set 

X = X± U ... U Xl lie in the same p-dimensional subspace, which renders the clustering problem 
ill-posed. 

We next compare the clustering conditions (J7|), (j8[), and (0 in Theorems [fl [2] and [3] with their 
counterparts for clustering of the original, high-dimensional data set y. Specifically, such reference 
conditions can be found in [16] Thm. 2] for TSC and in [28] Thm. 2.8] for SSC, but do not seem to 
be available for SSC-OMP for the statistical data model considered in this paper. However, setting 
$ = I in the proof of Theorem[3] we can easily get a reference condition for SSC-OMP. Rather than 
providing the details of this simple modification, we refer the reader to the proof in m Chap. 4], 

Corollary 1. Let pi := (ng — 1 )/di, £ £ [L\, and suppose that p m i n := min^p^ > po, where po > 1 
is a numerical constant. If 


max aff(5r,50 < 
k,t-. k^t v ' 


Vl°§ Pmin 
64 log IV 


( 10 ) 


then the graph G obtained by applying SSC-OMP to the original, high-dimensional data set y has 
no false connections with probability at least 1 — 21V -1 — X^=i 

We conclude that for all three algorithms the impact of dimensionality reduction is essentially 
quantified through a term proportional to y/d max /p that adds to the maximum affinity between 
the subspaces Si in the clustering conditions ([7]), ([8]), and ©. These clustering conditions nicely 
reflect the intuition that the smaller the affinities between the subspaces Si, the more aggressively 
we can reduce the dimensionality of the data set without compromising clustering performance. 

As the result in Corollary |T] is new, a few comments on its relation to existing results, specifically 
those in [7j and [37j, are in order. Corollary [Tj imposes less restrictive conditions on the relative 
orientations of the subspaces than [7, Thm. 3], [37] Thm. 2, Cor. 1], but makes stronger assumptions 
on the data model. The result in [371 Thm. 3] applies to subspaces with random orientations, and 
therefore does not allow for statements involving subspace affinities. We refer the reader to the 
thesis [31, Sec. 4.1] for a more detailed comparison of Corollary [T] above to [7, Thm. 3]. Finally, 
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numerical results corroborating the fundamental nature of the clustering condition (HOD can be 
found in [31 s Sec. 5.1]. 


4 Impact of noise 


In many practical applications the data points to be clustered are corrupted by noise, typically 
modeled as additive Gaussian noise. In this section, we study the interplay between dimensionality 
reduction and additive noise for the TSC algorithm. Specifically, we let the high-dimensional data 
points be corrupted by Gaussian noise according to 


y« = y M 
J l J l 


+ e 


(t) 


where ep ~ J\f{ 0, (cr 2 /m)I), and assume, as before, that yP is drawn i.i.d. uniformly from the 
intersection of the (/-dimensional subspace Si with the unit sphere. The dimensionality-reduced 
noisy data set X C K p is obtained by applying the same realization of the random projection matrix 
$ € M pxm to all (noisy) data points yp ■ The elements of the sets ft) in X = X\ U ... U X L are 
hence given by 



*(y? } + e ? } ) 


3 € [ng\. 


Theorem 4. Choose q such that q < min^n^/6, and let m > 6 log N. If 


( 11 ) 


«,o oi , v/il Vd 

max aft or,oHH-=-— 

kt-.h#. V y/35 y/p 


max ^ cr(l + (1)1/6 y/~d 


< 


y/ c log N y/P 15 log IV 


( 12 ) 


where d max = max^c/ and c = min(6, c) with c the constant in the concentration inequality (pfl) , 
then the graph G obtained by applying TSC to X has no false connections with probability at least 
1 — 14 N^ 1 — 2 Ne~ m — YP=i where c > 1/20 is a numerical constant. 

Theorem 0] states that in the noisy case—just as in the noiseless case—TSC succeeds for p as 
small as d max , order-wise, provided that the affinities between the subspaces Si are sufficiently 
small and X contains sufficiently many points from each subspace. More specifically, comparing 
the noiseless clustering condition ([7| to m, we can see that the impact of noise is simply to add 
the offset 1° t^ ie LHS of the clustering condition. For fixed a, owing to the factor 

y/dmax/p, the impact of noise on the effective affinity as quantified by the LHS of (fT2l) becomes 
more pronounced when the dimensionality is reduced more aggressively. 

Theorem0] continues to hold (with c in the term replaced by a numerical constant, 

and e~ m in the success probability replaced by e -m ), if noise &P ~ jV(0, {a 2 /p)I) is added after 


random projection according to Pp = +e) t 'L This is not surprising, as the absolute amount 


V) 


of noise injected remains the same, i.e., E 


~(e) 

e) 

2 

= E 

(£) 

e- 

2" 

l 

2 


2 

2 

ar to that used for Tf 


= (7 


We finally note that an approach simi 
result for SSC-OMP to the noisy case resulting in clustering conditions analogous to those for TSC. 
The corresponding technical details are, however, significantly more involved and cumbersome. We 
therefore decided not to state the formal result. We expect that a similar result can be proven for 
(a robust version of) SSC as our simulation results in Section 15.1.21 indicate that the qualitative 
behavior of all three algorithms in the presence of noise is essentially identical, and, in addition, is 
qualitatively accurately predicted by Theorem 0] 
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5 Numerical Results 


We evaluate the impact of dimensionality reduction on the clustering error (CE), i.e., the fraction 
of misclustered points, for TSC, SSC, and SSC-OMP applied to synthetic data as well as to pub¬ 
licly available standard data sets widely used in the subspace clustering literature. Specifically, we 
consider the problems of clustering faces, handwritten digits, and gene expression data. All three 
algorithms, TSC, SSC, and SSC-OMP, were observed to tolerate massive dimensionality reduction 
in all experiments. The performance ranking of the three algorithms according to CE varies consid¬ 
erably across data sets. Specifically, in order to demonstrate that none of the algorithms uniformly 
outperforms the others, we chose to report the results for all three data sets. We also compare 
the algorithms in terms of their running times on a PC with 32 GB RAM and 8-core Intel Core 
i7-3770K CPU clocked at 3.50 GHz. 

TSC and SSC-OMP were implemented in Matlab following the specifications in Section [2j For 
SSC, we used the Matlab implementation provided in [9], which is based on Lasso (instead of £\- 
minimization) and uses the Alternating Direction Method of Multipliers (ADMM). Code to repro¬ 
duce the experiments in this section is available at http: //www. nari . ee . ethz. ch/commth/research/. 
Information on the number of Monte Carlo runs used in our experiments is contained in this Matlab 
code. 

Unless stated otherwise, we select the Lasso parameter A in SSC from the set {0.001,0.002, 
0.004,0.008,0.01,0.02,0.04,0.08,0.1,0.2} such that the lowest clustering error is obtained on the 
original high-dimensional data set y. The parameters q and s max for TSC and SSC-OMP, re¬ 
spectively, are chosen analogously from the set {2,4,... , 18}. Although these parameter selection 
procedures may not yield the optimum parameters for the projected data set X for all realizations 
of 3», we desist from selecting the parameters for every realization of T* individually as this may 
lead to overly optimistic results. 

As projection matrices we consider i.i.d. A7(0,1/p) Gaussian random matrices (referred to as 
GRP) and fast random projection (FRP) matrices p] given by the real part of FD E C pxm , where 
D E M mxm is diagonal with main diagonal elements drawn i.i.d. uniformly from {—1,1}, and 
F E C pxm is obtained by choosing a set of p rows uniformly at random from the rows of an m x rn 
discrete Fourier transform (DFT) matrix. In all experiments the dimensionality-reduced data set 
X is obtained by applying the (same) realization of either a GRP or an FRP matrix to all data 
points in y. The FRP can be implemented efficiently by premultiplying y by D and then applying 
the FFT to each data point. With regards to storage space, we note that the FRP only requires 
the storage of a binary m-dimensional vector (namely the diagonal entries of D), in contrast to mp 
real numbers for GRPs. 

5.1 Synthetic data 

5.1.1 Comparison of TSC, SSC, and SSC-OMP 

We use the data model described in Section [3] with m = 2 15 = 32768 and generate L = 3 subspaces 
Si of K m of dimension d = 20 at random such that every pair of subspaces intersects in at least 
r dimensions; this implies aff (Sk,Sg) > y/r/d, for all k,£ E [L\,k / l. More specifically, we take 
the basis matrices to be given by = [U U^], where U E M mxr and the E M' mx ( rf-r ), 
f E [L], are chosen uniformly at random among all orthonormal matrices of dimensions m X r and 
rn x {d — r), respectively. We sample ng = 80 data points, for each i E [L\, resulting in a total of 
N = 240 data points. 

In Figure [D we plot the CE as a function of p for TSC, SSC, and SSC-OMP applied to the 
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GRP 


FRP 




Figure 1: Clustering error for synthetic data as a function of p using GRP (left) and FRP (right). 
Recall that for r = 4 and r = 8 we have aff(£&,*%) > ^/l/5 and aff(5fc,5^) > a/2/5, respectively, 
for all k,£ € [L], k ^ L 



Figure 2: Running times (in seconds) for clustering synthetic data. 


dimensionality-reduced data set X with r = 4 and r = 8. Figure [2] shows the running times 
corresponding to the application of the FRP and the GRP matrix to the (entire) data set (V along 
with the running times of the clustering algorithms alone. 

The results show, as predicted by Theorems!]] 03 that TSC, SSC, and SSC-OMP, indeed, succeed 
provided that yjd/p is sufficiently small. Specifically, we observe a transition to CE ~ 0 for p 
between 20 and 100. As the subspaces Sg are of dimension 20 this corroborates the fact that 
the dimensionality of the data can be reduced down to the dimension of the subspaces without 
compromising clustering performance significantly. Equivalently, we accomplish a dimensionality 
reduction by a factor of about 1600-320. 

For all three algorithms the numerical results further confirm the tradeoff between the affinities 
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TSC 


SSC 


SSC-OMP 




y/djp y/d/p y/d/p 

Figure 3: CE (color coded) as a function of yjd/p and a for L = 2 orthogonal subspaces of 1R 100 . 
The black lines correspond to the curve yf d/p( 0.8 + cr(0.1 + a)) = 0.8, and roughly separate the 
regimes where clustering succeeds from that where it fails. 


of the Sg and the amount of dimensionality reduction possible as quantified by the clustering 
conditions (ED, 4HD, and (l9l) . Specifically, the CE increases as v and hence aff (*S g . *Sg ) increases. In 
this example, SSC consistently outperforms TSC and SSC-OMP, albeit at the cost of significantly 
longer running time (see Figure ED- While the running time of SSC exhibits very pronounced 
increasing behavior in p, that of SSC-OMP shows much less pronounced increases, and that of TSC 
does not increase notably in p. It is furthermore interesting to see that the clustering performance 
is essentially identical for FRP and GRP. This is remarkable as the application of FRP requires 
only 0(m log m) operations (per data point) and therefore its running time does not depend on p. 
Application of the GRP, in contrast, requires 0(mp) operations (per data point), which results in 
a running time that is linear in p. 

5.1.2 Impact of noise 

In the next experiment we study the interplay between noise and dimensionality reduction. We use 
the data model described in Section [3] with m = 100 and generate L = 2 orthogonal subspaces Sg of 
M m of dimension d = 10. This ensures that the affinity between the subspaces equals 0 (fixing the 
affinity to some other constant would not change the qualitative conclusions). We generate the noisy 
data set y by sampling ng = 30 points from each of the two subspaces and adding J\T{ 0 , (a 2 /m) I) 
noise. Figure [3] shows the CE as a function of yjd/p and a for dimensionality reduction via GRP. 

The clustering condition in Theorem 0] guarantees that TSC succeeds as long as yfdfp[c\ + 
ct(c 2 + a)) < C 3 , where 01 , 02,03 are independent of d,p,m, and a 2 . In order to find out whether 
this sufficient condition predicts the fundamental clustering behavior qualitatively correctly, we 
test whether a phase transition, separating the region where clustering succeeds from that where 
it fails, indeed, occurs at 


y[dfp(c\ + a(c 2 + a)) = c 3 . (13) 

To this end, we fit m — by choosing 01 , 02 , 03 —into the plots in Figure [3] and observe that the 
answer is in the affirmative. Moreover, our numerical results show that the phase transition behavior 
of SSC and SSC-OMP is essentially identical to that of TSC, which provides evidence for SSC and 
SSC-OMP behaving similarly to TSC in the noisy case. 
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5.1.3 Dimensionality reduction when the subspaces span the ambient space 

As noted in Section [21 Theorems mm indicate that dimensionality reduction down to the order 
of the subspace dimensions is possible even when the subspaces Si span the ambient space M m . 
To verify this observation empirically, we perform the following experiment. We draw a random 
Gaussian matrix V € flj 200x200 . With probability one, the columns of V span M 200 . We then 
extract the 200 x 20 matrices from V according to [W 1 ) ... \A 10 )] = V, and let the subspace 
Si be given by the span of V 1 -^, i = 1,..., 10. This guarantees that the union of the Si span M 200 . 
Note, however, that the affinities between pairs of the resulting subspaces will be small with high 
probability. We again use the data model described in Section [3] and sample ni = 60 points on 
Si n § rf/!_1 , for all i € [L\, to obtain a data set y with a total of N = 600 points. We select the 
values for q, A, and s max that yield the lowest CE for the majority of values for p. 

Figure [4] shows the CE as a function of p for TSC, SSC, and SSC-OMP. The CE starts to be 
non-zero for p < 60 for TSC and SSC-OMP, and for p < 40 for SSC. We therefore conclude that 
the dimensionality can, indeed, be reduced, quite significantly, even when the subspaces span the 
ambient space, as indicated by Theorems [l] 01 



Figure 4: CE as a function of p for L 


10 subspaces that collectively span the ambient space M 200 . 


5.2 Clustering faces 

We next evaluate the impact of dimensionality reduction in the problem of clustering face images 
taken from the Extended Yale B data set mm, which contains 192 x 168 pixel (m = 32256) 
frontal face images of 38 individuals, with 64 images per individual, each acquired under different 
illumination conditions. The motivation for applying subspace clustering algorithms to this problem 
stems from the insight that the vectorized images of a given face taken under varying illumination 
conditions lie approximately in a 9-dimensional linear subspace [3]. Each 9-dimensional subspace 
Si would then contain the images corresponding to a given person. 

We generate y by first selecting a subset of L = 2 individuals uniformly at random from the 
2 ) pairs and then collecting all images corresponding to the two selected individuals. In 
Figure [5j we plot the corresponding CE and the running times as a function of p. Again, for each p, 
the CE and the running times are obtained by averaging over 500 problem instances generated by 
randomly choosing 100 instances of y and 5 realizations of the projection matrix per chosen data 
set y. In contrast to the preceding experiment, here, SSC-OMP consistently outperforms TSC and 
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SSC. For all three algorithms the dimensionality of the data can be reduced by a factor of about 
100 without notably increasing the CE. Note, however, that in this experiment the dimensionality 
cannot be reduced as aggressively as in the preceding synthetic data experiment. Specifically, here 
the data points lie in 9-dimensional subspaces and dimensionality reduction by a factor of 100 
corresponds to p ~ 322. One possible explanation for this observation is that the principal angles 
between the subspaces spanned by the face images of different subjects are typically small (see 
m Sec. 7]), which means that the subspace affinities in this data set are large. The conclusions 
regarding running times and choice of the random projection matrix are analogous to those reported 
for synthetic data above. 




Figure 5: Clustering error and running times (in seconds) for clustering L = 2 faces from the 
Extended Yale B data set. 


5.3 Clustering handwritten digits 

In this experiment, we investigate the impact of dimensionality reduction in the context of clus¬ 
tering images of handwritten digits. We use the MNIST data set [23] containing 10,000 images 
of (horizontally and vertically) aligned handwritten digits of size 28 x 28 pixels (m = 784). The 
motivation for employing subspace clustering in this context stems from the observation that vec¬ 
torized images of different handwritten versions of the same digit tend to lie near a low-dimensional 
subspace [T41j . 

We generate the data sets y by selecting 250 images (out of 1000) uniformly at random from 
each of the sets corresponding to the digits 2, 4, and 8. There is no specific reason for our choice 
of the digits 2, 4, and 8; other combinations of three digits yield similar results. However, some 
combinations of digits are more difficult to cluster than others, e.g., 1 and 7 are “closer” (in terms 
of the affinities between the subspaces the corresponding images approximately lie in) than 1 and 
8 ; clustering 1 and 7 therefore typically results in a larger error than clustering 1 and 8. The 
results depicted in Figure [HI show that the dimensionality of the data set can be reduced from 
m = 784 to p = 200, i.e., by a factor of 3.9, without notably increasing the CE incurred by TSC 
and SSC. For sufficiently large p, TSC yields a slightly lower clustering error than SSC. SSC-OMP 
is outperformed considerably by the other two algorithms. 
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Figure 6: Clustering error for handwritten digits 2,4, and 8 from the MNIST data set. 

5.4 Clustering gene expression data 

Finally, we consider clustering of gene expression level data-originating from different types of 
cancer cells-according to cancer type. This problem is of significant practical relevance as it helps, 
inter alia, to identify genes that are involved in the same cellular process [20]. The use of subspace 
clustering in this context was suggested in [20]. We use the publicly available Novartis multi-tissue 
data set from the Broad Institute Cancer Program database [5]. This data set contains the 1000- 
dimensional gene expression level data of n = 103 tissue samples taken from L = 4 different cancer 
types. In order to illustrate that the gene expression level vectors of a single cancer type, indeed, 
lie near a low-dimensional subspace, we plot, in Figure [7] the singular values of the data matrices 
corresponding to a single cancer type. We observe that the singular values decay rapidly and for 
every cancer type, more than 94% of the energy of the corresponding data vectors is concentrated 
in a 6-dimensional subspace of the 1000-dimensional ambient space. 

We cluster all n = 103 available samples. The CE obtained by averaging, for each p, over 200 
realizations of the random projection matrix is shown in Figure [TJ For p ~ 100, which corresponds 
to dimensionality reduction by a factor of 10, the CEs of TSC and SSC are comparable to those 
obtained when operating on the original high-dimensional data set. SSC is seen to consistently 
(across p) perform best, followed by TSC and SSC-OMP. As in previous experiments the CEs 
observed for GRP and FRP, for each of the three algorithms, are virtually identical. 
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A Proof of Theorems |T] and [4] 

The proof idea for Theorem [T] is to turn the effect of the random projection into an additive 
perturbation and to show that this perturbation is small for all values of p down to the order of 
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Figure 7: Clustering error for gene expression level data of L = 4 cancer types (left). Singular 
values of data matrices corresponding to a single cancer type (right). 


d m ax• In the noisy case, addressed by Theorem 01 we have an additional perturbation due to noise. 
We detail the proof of the more general Theorem 0] below, and explain in Appendix IA.31 the simple 
changes that yield Theorem 01 The proof of Theorem 0] follows closely that of p6j Thm. 3], which 
quantifies the performance of TSC under additive Gaussian noise alone. We therefore elaborate 
only on the steps that are new relative to m and encourage the interested reader to consult [16] 
for the arguments not repeated here. 

The graph G obtained by applying TSC to the dimensionality-reduced noisy data set X has no 
false connections, i.e., each x) ' is connected to points in Xu only, if for each x) ; € Xu the associated 
set Si corresponds to points in X( only, for all i. This is the case if 


z 


W 

(«/-?) 


> max z. 


(fc) 


k&,j 


j 


(14) 


where zp^ := |(x^,x^)| and z|^ < z^j < ... < z^-i) are ^ ie order statistics of {zp} je[ru]\{i} 
and max^fj denotes maximization over k E [L], k ^ l, and over the indices j of the corresponding 
points x) ' € Xfr. Note that, for simplicity of exposition, the notation z- ; does not reflect depen- 
dence on x) '. The proof is established by upper-bounding the probability of (|MI) being violated for 
a given data point Sep. A union bound over all N points Sip , i € [ne\,l € [L\, then yields the final 
result. We start by setting z^ := (yp\yP^ , where y P = are the original data points 

in the (high-dimensional) space R m , and noting that zp = /x^, Stp 
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Here, the term accounts for the perturbation caused by random projection, and corresponds 
to the perturbation caused by noise. The probability of (fHl) being violated can now be upper- 
bounded according to 
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We refer the reader to |16L Proof of Thm. 3, Eq. (40)] for an explanation of the steps leading to (|16l) 
(while pill Eq. (40)] is not completely equivalent to (fT6l) . the steps leading to (fl6l) are essentially 
identical). Resolving the assumption (1181) leads to 
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which is implied by (fl2l) (using that y/28d max + 8 log L + 8 log N /\/log N < \J 44d max because 
log L/log N < 1, d max > 1, and logiV > 1 for IV > 3). With e as defined in (fTTj) . and the triangle 
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inequality, it follows that max^j.^^ | e- | > e implies that either ma *-(j,k)^(i,£) \ e 
|e^| > S', or both. Therefore, by a union bound argument 
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Here, the first and second term on the RHS of (fT9l) correspond to the perturbation caused by 
random projection and by noise, respectively. As established in Sections IA.1I and IA.2I these terms 
can be upper-bounded by and 2 e -m + respectively, which yields 


I ( fc )| 
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( 20 ) 


The remaining terms on the RHS of (|16l) are upper-bounded as shown in Steps 3 and 2 in [l 6 l 
Proof of Thm. 3], respectively, using standard concentration of measure results, according to 
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where c > 1/20 is a numerical constant, and we employed the assumption ni > 6 q, for all t. 

With (12U1) we thus get that (fT4l) is violated with probability at most e~ c ^ n£ ~^ +2e -m + 141V _2 . 
Taking the union bound over all points € [n(\,l € [L], finishes the proof. 


A.l Perturbation caused by random projection 

We next show that the first term on the RHS of (flOl) is upper-bounded by 4/1V 2 . This term 
corresponds to the perturbation caused by random projection. For notational convenience, we set 
B m = U^ T ($ t $ - I)UW and note that 
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where m follows from 


B Ma 
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from (|45|) in Appendix [B] with r = 
d = dg, and /3 = y/6 log N. 


< ||Bfc ^|| 2 (1221) is by the union bound, and (1231) follows 
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Proposition 1 (E.g., |33l Ex. 5.25]). Let a be uniformly distributed on§ d 1 and fix b e W l . Then, 
for (3 > 0 , we have 


( a , b )l > ^H b ll2 


< 2 e 2 


A.2 Perturbation caused by noise 

In this section, we deal with the perturbation caused by noise. Specifically, we establish that the 
second term on the RHS of (fT9l) satisfies 


max 


j(*)| > ptEdL±Tig 


m 




(24) 


• • rji (fo') ... . (,£') 

For notational convenience, we set y) = $ 3>y )• and drop the indices i and l to write y = y] , 
y = y , e = e . We first note that 


max 


;( fc ) I 


> +flA 

V 7 ™ J 


u 


~( k ) I 


ym J 


C 


U 

U> k )¥=M 


#) 


>P 



U 







(*) 




m 


(25) 



Here, (1251) follows from the triangle inequality. To verify (1261) . consider the first event in (1251) and 
note that 


_(fc) 

y) ,e 


To see this, simply take the complement of (f2TT) according to 


_(fc) 

y) > e 


>P 


a 




(27) 


111^112^2 <«'}n 

where we used 

"-(fc) 

yj 


(fc) 


$ T $y?° 


</? 


(7 


< ||$ T $| 




2—>2 




,(*) 


= ||$ T $| 


2 —>2' 
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Treating the second and the third event in (12511 similarly establishes (12611 . A union bound argument 
now yields 


max |ej fc) | > P ^^^ F 


m 


< P [||$ t $|| 2 > 5' 


y- be 


>p 


a 


+ E p 

U> k )¥=M 

+ E p [!(^-y)|>^llyll 





2 


+ E p 
+ E p 




> p % 

$ T $e (fc ) 


yjm 

0 

2 


,(*) 


> 2 u 


<2e m + 6Ne 2 + Ne 2 . 


(28) 

(29) 

(30) 

(31) 

(32) 

(33) 


To get (1551) we upper-bounded the terms on the RHSs of (1551) - (1551) as follows. For the RHS of (1551) 
we note that 


P [||$ t $|| 2 > S'] < 2e~ m , 


which is a consequence of Theorem [ 6 ] stated in Appendix [B] below. Specifically, with 1 < 
which follows from c = min( 6 , c) < 6 and p < rn. both by assumption, we have 



P 




2-S-2 — 



< P 


< P 


cjp $ > 1 + 

112^2 — ~ 


$ - I > 

A 112—^2 — 



< 2 e" m 


(34) 

(35) 


where (1351) is by Theorem [ 6 ] (with t = \J2rn). To establish (1541) . first note that ||<I > 7 $ J-II 2—>2 — ^ 
(with 5' = J^) implies cr max ($ i 3>) < 1 + 5', which in turn is equivalent to 11<3 > 7 3 >|| 2 _^ 2 — 1 + $ ■ 

We can therefore conclude that ||<I> T <&|| 9 > 1 + 5' implies ||3 » 7 $ ^ 112—>2 — ^ ’ 

The terms inside the sums on the RHSs of (1591) . (1301) . and (13T1) . were upper-bounded by ap¬ 
plying Lemma [U stated below. Specifically, we note that (y^,e) ~ A/"(0,cr 2 ||y^|| 2 ), (e^,y) ~ 
A7(0, a 2 ||y || 2 ), and (& T &e^ k \e) ~ AA(0, cr 2 1 1^e^|| 2 ), where y^, y, and respectively, 

can be regarded as fixed, and we used B = J 6 log N > -)=, as N > 1. 

V Z7T 

Lemma 1 (j22l Prop. 19.4.2]). Let x ~ A7(0,1). For f3 > ^==, we have 

P[x> f3\<e~^. (36) 


Finally, to upper-bound the terms inside the sum in (1551) . we used |16| Eq. (51)] 




> 2a 


< e 2 


(37) 
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A.3 Proof of Theorem [T] 

The proof of Theorem Q] is obtained from the proof of Theorem H] by noting that in the noise-free 
case (i.e., o = 0 ), the perturbation caused by noise satisfies = 0 , rendering the second term on 
the RHS of (11911 void. Finally, we remark that the assumption 6 log A < m is not needed in the 
noise-free case as it is involved only in establishing (1241) . which is void here. 


B Proof of Theorem [2] 

We first note that the data points in Ag can be written as = V^a ^\j € [n^], where the 
a'p are i.i.d. uniform on S de ~ 1 , and := <f>U^ is a basis for the e^-dimensional subspace of 
containing the points in Xf (V ^- 1 has full column rank with high probability, which follows 
from (PHI) as a consequence of the concentration inequality dU). For the case where the are 
orthonormal bases a sufficient condition for successful clustering was derived by Soltanolkotabi and 
Candes [28, Thm. 2.8]. However, owing to the projection the = ‘FU ^- 1 will in general not 
be orthonormal. We will therefore need the following generalization of [251 Thm. 2.8] to arbitrary 
bases for ^-dimensional subspaces of M p . 


Theorem 5. Suppose that the elements of the sets X( in X = AiU .. . UA^ are obtained by choosing 
n£ points at random according to = V^a f\j € [n^], where the € MP xde have full rank, 

and the are i.i.d. uniform on § d * _1 . Assume that pi = (n^ — 1 )/dg > po, for all l, where po > 1 
is a numerical constant, and let p m \ n = mirq p^. If 


1 

max —= 
k,£: k^i yjdk 


vW f vW 


F 


< 


Vlog Pmin 
64 log A 


(38) 


where V® 1 = (VW T VW) l \W T is the pseudo-inverse of V^, then the graph G with adjacency 
matrix obtained by applying SSC to X has no false connections with probability at least 1 — A” 1 — 


Proof. See Appendix lB.il 


□ 


We now detail how Theorem [2] follows from Theorem [5[ Specifically, we will show that ([ 8 |) 
implies (1381) with probability at least 1 — 4e -r / 2 , which, when combined with the probability bound 
in Theorem [5] via a union bound yields the final probability estimate in Theorem [2j and thereby 
concludes the proof. 

We start filling in the details by showing how ([ 8 ]) implies (1381) . The LHS of (1381) can be upper- 
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bounded as follows 

1 vW f VW 


y/ dk 
< 

< 

< 

< 


1 


y/dic 


(vW vW 


-i 


(VM J VW) 
(VW T VW) -1 


1 


y(<) yW 


1 -5 


Vdk 

r rt>\\ —1 

2—>2 

aff (<5^, 5^) + 5) 


2—^2 yj~dk 

uW T u( fe ) 


(39) 


2—>2 


+ 


uW T ($ t $ - I)u (fc) 


(vW vW) 

1 


iff(5 fc ,5^)+ uW T ($ r $-I)U^ 


2—>2 


65 ^ c ^ ^ ^ V1°S Pmin 

<-(aff( l S fe) ^) + <5)<-^ ! ^ r) 


(40) 

(41) 

(42) 


where (f39l) follows from ||AB||^ < ||A|| 2 _ > 2 II®IIf) ^0^ is a consequence of ||B|| F < y/m A n||B|| 2 ^. 2 , 
for B € M mxn [J 8 ) Sec. 5.6, p. 365], and (I4TT) holds with 


5 := 


' 28 d max + 8 log L + 2r 


3 cp 


(43) 


with probability at least 1 — 4e r / 2 (here, r > 0 is the numerical constant in the statement of 
Theorem © . Eq. (T4TD holds with probability at least 1 — 4e -r / 2 by 


max 

l 


(vW r vW) 


-1 


> 


2—>2 1—5 


and 


max 

kl 


uW T ($ r $ - I)U (fc) 


> 5 


2—>2 


< 2 e _r/2 


< 2 e" r/2 , 


(44) 


(45) 


both proven below. Finally, to get (f42li we invoked ([ 8 ]) twice, first we used aff (Sk,Se) > 0 and 


Vlog Pmin 


Vlog Pmin 


log N — - —j-jyj < 1 in ([ 8 ]) to conclude that 5 < 1/65, i.e., and second, we 

applied ([ 8 ]) straight to upper-bound aff(5^,5^). 

It remains to prove (EH1) and (14511 . For the special case of a Gaussian random matrix 3>, the 
probability bounds (|44[) and (|45D can be obtained using standard results on the extremal singular 
values of Gaussian random matrices. For general satisfying the concentration inequality ©, the 
proofs of (|44|1 and (l45l) rely on Theorem (| 6 |) below. 

Theorem 6 ( [101 Thm. 9.9, Rem. 9.10]). Suppose that the random matrix $ £ MP xm satisfies the 
concentration inequality ©, i.e., 


I c[>x II — 11 X 11 
1^^112 II -*-112 


> t||x 


< 2e 


—ct 2 p 


for all i > 0 and for all x € W 71 , where c is a constant. Then, for an orthonormal matrix U £ M mxrf 
and all t > 0, we have 


|u T 4> T *u -i|| 2 _ 2 > 


14d + 2t 2 
3 cp 


< 2 e 2 . 
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Additionally, for all t > 0, we have 


I II 2^2 — 


' 14m + 2 1 2 
3 cp 


< 2 e“ 


Proof of (14411 : By a union bound argument, we get 

L 


max 

l 


(v(0 J vW) 


-1 


> 


1 


2—>2 1 — (5 


< 


(vW vW) 


-1 


> 


t= 1 


2—>2 1—5 


(46) 


Note that ||V^ T V^ — l || 9 _^ 2 < 5 implies that cr m i n ^V^ T V^^ >1 — 5, which in turn implies 
||(VW T VW )- 1 || 2 ^ 2 < We can therefore conclude that ||(vW T V^) _ 1 1 | 2 ^ 2 ^ implies 

||VW T VW — l|| > 5, which can be formalized according to 


j||(v« T vW )- 1 || 2 ^ 2 > c j||vW T vW 




}• 


Moreover, we have with 5 as defined in (l43l) 5 = y - 


— . / 28 (imax~l~ 2 ^ 

3 cp — 

with t 2 = 4 log L + t. Therefore, Theorem [ 6 ] (with U = U ^- 1 and t 2 = 4 log L + r) yields 


( 2 d max > dmax > <fc), 


(yW T vW) 


-i 


> 


l 


2—>2 1 — 5 

which when used on the RHS of (146(1 establishes (1441) . 

Proof of ([45]) : Again, by a union bound argument, we get 

L 


< 2 e _21ogi_T / 2 = 2 L~ 2 e~ T/2 < 2 L" 1 e" r/2 , 


max 

kJt 


trM ($ T $-I)U (fc) 


> 5 


2—>2 


< p uW - l ) xj(fc) 


k,e =i 


> 5 


2—>2 


(47) 


We next upper-bound the probabilities on the RHS of (jiTj) . To this end, let U € R mxci be an 
orthonormal basis for the d-dimensional span of [U^ (nrax(d£, d]f) < d < d^+d^). Since UU T 

is the orthogonal projection onto span([U^ U^]), we have UU T U ( ^ = and UU T UW = 
Xjl fc ). Therefore, we get 


UW T ($ t $ _ I)U (fe) = u (f)I uu T ($ r $ - I)UU T U (fc) 


2—>2 


< 


u« T u 


2—>2 


U T $ T $U _ I 


2—>2 

u T u( fc ) 


2—>2 


2 —s -2 


U T $ T $U _ I 


2—>2 
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where we used 
are orthonorma 


U ((,)T U = 1, which holds since is in the span of U and both and U 


2—>2 


. This finally yields, with 6 as defined in (1431) . 


U (f)T ($ T f - I)U (fc) 


> <5 


2—>2 


< P 


< P 


< 2e~ 


U T $ T $LJ - I 


U T $ T $U-I 


> 


2—>2 


' 28d max + 8 log L + 2 t 
3 cp 


> 


2—>2 


14d + 8 log L + 2t 
3 cp 


4 log Z/+r 


= 2 L” 2 e~ r/2 , 


(48) 

(49) 


where (1481) follows from 2d max > d^ + dk > d, and (PT9l) is by application of Theorem [6] with U = U 
and t 2 = 4log L + r. The proof is concluded by using (l49|) on the RHS of il47l) . 


B.l Proof of Theorem [5] 

Theorem [5] is a generalization of a result by Soltanolkotabi and Candes |28l Thm. 2.8] from or¬ 
thonormal bases for (^-dimensional subspaces of M. p to arbitrary bases V ( ^ for (^-dimensional 
subspaces. The proof program essentially follows that of [28, Thm. 2.8]. However, some parts of 
the generalization are non-trivial. We only detail the arguments that are new relative to [28], and 
refer to |28 j otherwise. 

Throughout the proof, we use the following notation: Let X^i G R pxn<! be the matrix whose 
columns are the points in Xf, and note that = V^A^, where G W deXnt is the matrix 
with columns a f\i = 1 Set X = [X^ ... X 1 -^] G M pxAr , and let X^j be the matrix 

obtained by removing the ?’th column x, from X. P(X) denotes the symmetrized convex hull of 
the columns of X (i.e., the points in X), that is, the convex hull of {xi, —xi,..., x^v, —x^v}- For a 
convex body V, its inradius r(V) is defined as the radius of the largest Euclidean ball that can be 
inscribed in V, and its circumradius R(V) is defined as the radius of the smallest ball containing 
V. Finally, the polar set of /C C M n is defined as 

JC° = {y G M n : (x, y) < 1 for all x G 1C}. 

B.1.1 A deterministic clustering condition 

We first establish a deterministic clustering condition. Specifically, in Theorem [7] below we present 
conditions guaranteeing that for x, G Xe every solution of the problem 

minimize ||z||subject to X_jZ = x, (50) 

Z 

has non-zero entries corresponding to columns of X ( ^ only. The proof of Theorem[5]is then obtained 
by proving that these conditions are satisfied with high probability for the statistical data model 
in Theorem [5] We start by introducing terminology needed in the following. Define the primal 
optimization problem 

P{ y, A): minimize || z ||-. subject to Az = y 

Z 

with the corresponding dual j4] Sec. 5.1.16] 

D( y, A): maximize (y, u) subject to ||A t iz|| < 1. 

V 11 11oo 
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The problem (1501) is then simply P(xj, X_j). The sets of optimal solutions of P and D are denoted 
by optsoLP(y, A) and optsol.D(y, A), respectively. A dual point A(y, A) is defined as a point in 
optsolD(y, A) of minimal Euclidean norm. 

We are now ready to state the following generalization of [28], Thm. 2.5] from orthonormal bases 
for (^-dimensional subspaces of to arbitrary bases for c/^-dimensional subspaces. 


Theorem 7. Suppose that the elements of the sets X{ in X = X\L). ..U Xl are obtained by choosing 
ng points according to ~pp = V^a.• , / E [ng], where the stp are deterministic coefficient vectors 
and the E R px< ^ are deterministic matrices of full column rank. Let L E L the 

matrix whose columns are the normalized dual points A(a^, Ap) = A(aj^, A^])/|| A(aj ■\ A^])|| 2 , 
where A^j is the matrix with columns a P ,j € [ng] \ {/}. If 


max 

k&,j 


L T V (/ > t V ( * ) af ) 


< r(V( A^)), 


(51) 


then the non-zero entries of all solutions of P(xP, X_o^)) correspond to points in Xg only (the 
columns ofK_n^ are the elements in X \ {x.-^}). 

Proof. The proof relies on the following lemma. 

Lemma 2 ( [281 Lem. 7.1], |6j). Let T be a subset of the column indices of a given matrix A. All 
solutions c* of P{ y, A) satisfy cL = 0, if there exists a vector c such that y = Ac with support 
S CT, and a (dual certificate) vector u obeying 

A. t s v = sign (eg) (52) 

IIWsHL S 1 (53) 

ll A ?HL < L < 54 > 


We apply Lemma [2] with A = , y = xp , and T the index set corresponding to the columns 

of X^f], and show that there exists a vector c supported onSCT that obeys = X_o^c, and 
a corresponding vector u that satisfies (152(1 (j54jl . This then implies that the non-zero entries of all 
solutions of P(x,|^,X_(j^) correspond to points in Xg only, as desired. 

We proceed with the explicit construction of the vector c. Specifically, take c to be a vector 
that is zero on T, and whose restriction to the index set T is given by c t € optsolP(x^\ X.P). Let 

S be the support of ct, and let vp = (' V ^ T ) A P, where A P is taken to be a point of minimum 
/^-nonrH in optsolD(a)^, A^f]). The next step is to show that v>P E optsoLD(x)^, X^f]), which will 
eventually allow us to establish that v'p satisfies the conditions of Lemma [2] To this end, we first 

For concreteness A^ ‘ is taken to be a point of minimum ^2-norm. Note, however, that for the proof to work we 
may let be an arbitrary point in optsolP(a])T A(f]). 
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(A W)Wi; 


< 1 


note that yields 

optsol.D(xf\x^]) 

= |argmax^a^, v^j subject to 
= jiz: \ = W i - ( ' )1 v, A G optsoLD(a,j f \ A^])j 
D (V wT ) t optsolD(af ) , A^]), 

T t 

where the inclusion holds as (V^ ) A is the minimum norm solution to the linear system of 
equations A = iz, but in general not the only solution. 

Since P(y, A) is a linear program, strong duality jU Sec. 5.2.3] holds (provided that P(y, A) is 
feasible) and therefore the optimal objective values of P(xf\xP i ) and D(xP ,Xpj) coincide. It 
therefore follows that 


I c tIIi = UP'U 


(55) 


Since cj~ G optsolP(x^, X^f]) and c t is supported on S, both by assumption, we have xp- 1 = 
Xp^c t = (X^jgcg, and therefore (f55l) becomes 

(c 5 ,sign(c s )) = /(xP^scs, up ) = /c s ,(X { p) T s u ( P 


On the other hand, as vp G optsolP(x^, X^f]), we have 


UP) U 


(56) 

< 1, which is equivalent 


to the following conditions (recall that the set T corresponds to the column indices of Xpj): 

< 1 (57) 


«x! 2 ) s )V> 


((X-iWAf’ 


< 1. 


(58) 


As by (|57|) . the entries of (X.P\) q vp are bounded in magnitude by 1 and the unique maximizer of 


-i>S * 

max a: |i a i| <! (c 5 ,a) is sign(cs), it follows from (|56j) that 


(xPUsU = si S n ( c s), 


which establishes 

Thanks to (158|) . (153[) is satisfied as well. It remains to verify ([5iD , which here reads 


UU v iU < f° r ^ f° r all j € [rik\. 


(59) 


With vf* = (V^ T )^aP, by definition, (1591) becomes 


j®, (yW T ^ 




A W 


l 

2 


< 


A® 


l 

2 


, for all k ^ £, for all j G [n^]. 


(60) 
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Since ((V^ T ) ) T = and = V^a (1601) is equivalent to 


A 


W T 


i _ v wt V (fc) a (fc) 




< 


A W 


' l 

2 


, for all k £, for all j E [n^]. 


(61) 


It now follows from A^ € optsolZ?(a^\ A^f]) which holds by assumption, that 

||(A ( ^) T Af ) || < 1 . 

11 ' —i' l II oo — 

This, in turn, implies that Ap E V°{A^ i ) where 


-pc 


(A^]) = {z: (A^]) T z < l} 

^ OO ) 


is the polar set of P(A^j) (recall that P(A^) is the symmetrized convex hull of the columns in 

A (£ h. Since the inradius and the circumradius of a symmetricn convex body are related according 
to m Thm. 1.2] 

r(V)R(V°) = 1, 

we get from A-^ E V°(A^]) that 


W 


A 


(<) 


< R(V°(A®)) = - 
2 r(V( A®) 


1 


(62) 


By (1621) . it follows that (I6T1) holds if 




i _ v (£)t V (fc) a (fc) 


M) 


< r(V( A^f])), for all k / £, for all j E [n^], 


which is implied by ([5T1) . This proves that (IMl) is satisfied as well, thereby concluding the proof. □ 

B.1.2 Evaluating the deterministic clustering condition for the statistical data model 

Theorem [5] now follows from Theorem [7] by establishing that, for our statistical data model, the 
deterministic clustering condition f[5Tji holds for all pairs (£,i) with £ E [L\,i E [n^], with high 
probability. Specifically, by a union bound argument, we get 


PfflEEJ is violated for at least one pair (£,»)] 

L T vW f vWa (fc) 


iEf 


max 

k^l,j 


>r(V( aL'])) 


< 


(t,i) 


1=1 


(p 

max 

V 

k&,j 

nte 

-y/pide 


jTy(^y(k) (k) 

3 


> 


16 log N 


VdA 




+ P 


Vlog Pe 
. 4 


> r(V( A^)) 


(63) 

(64) 


4 A convex body V is called symmetric if x 6 V if and only if —x £ V. 
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In (1631) we used that for random variables X and Y , possibly dependent, and constants < j> and ip 
satisfying (j> < <p, we have 


P[X > Y] < P[{X >(/)}U{v> Y}\ 
< P[X > <j>] + P[tp > Y] . 


(65) 


Specifically, we 
assumption 


implied by (l38l) 


applied (1551) with <f> = and <p = e , 


16 log N 


v W f V (fc) 


V dgdk 

To get (l64l) we used that, for all i, 


< for all k,l:k£l, 

F 4y dg 


which leads to the 


yiog pi 

4 Vd~l 


>r(V(A 2)) 


< e —y/Pldl 


( 66 ) 


and 


p 

max 

L T V wt V^a {fc) 

^ 16 log IV 

V w t v (fc) 




j 

00 V dgdk 


F 


(67) 


both of which are established next. 

The upper bound (1661) is an application of [28j Lem. 7.4], [2], and makes use of the assumption 
(ng — l)/dg = pi > po > 1. Finally, (l67l) follows from a union bound argument and 


jT V (tfy(k) (k) 

^ 16 log N 

vW f v (fe) 


J 

00 V dgdk 


F 


< (n^ + l)e- 41ogJV <iV- 3 , (68) 


which is a consequence of Lemma [3] below together with the fact that the normalized dual point 
\f ] = AfV||\ W ||2 is distributed uniformly on the unit sphere, as shown in [ 28l Sec. 7.2.2 Proof of 
Step 2], 

Lemma 3 (Extracted from the proof of Lemma 7.5 in [25]). Let the columns of L £ 

i.i.d. uniform on let a be uniform on § d2 1 , and let B € R. dlXd2 . Then, for c> 12, we have 


P 


L T Ba|| > 

11 OO — 



< (ni + l)e 4. 


C Proof of Theorem [3] 

The graph G obtained by SSC-OMP has no false connections if for each x) ' € Xg the OMP 
algorithm as detailed in Section [2] selects points from Xg only, for all £ £ [L\. This is the case if 
OMP selects points from Xg in all iterations s £ [s en d] (we explain below that OMP terminates 
after s e nd = %ax A dg iterations with high probability for our statistical data model). The OMP 
selection rule ([2]) implies that OMP selects a point from Xg in the (s + l)th iteration if 


max 



< max 
ie[nd : j^i 



(69) 
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Hence, the graph G obtained by SSC-OMP has no false connections if the deterministic clustering 
condition (]0U1) holds for all s en d OMP iterations, for all E Xp. I E [L\. We will next establish 
that (l69l) is satisfied for our statistical data model with probability obeying the bound in Theorem 

13 

As a vehicle for our analysis, we introduce the reduced OMP algorithm which, to compute sparse 
representations of the xj^, has access to the corresponding reduced data sets A^\{xj^ } only, instead 
of the full data sets ft’\{x[^}. If, for a given data set X, the residuals computed by reduced OMP, 
henceforth denoted by r^\ satisfy (1691) for all iterations , then the reduced OMP algorithm and the 
original OMP algorithm (processing the same data set X) select exactly the same data points in 
the same order and we have r s = for all s E [s max A dp] by virtue of d3]) . We emphasize that for 

expositional convenience the notations r y s ’ and ’ do not reflect dependence on i. The motivation 

U) 

for working with the reduced OMP algorithm is that r y s J being a function of the data points in Xp 
only, conditionally on 3>, is statistically independent of the data points in X\Xp. This will allow us 
to establish tail bounds for |(x^,r^}|, k ^ i, j E [n*,], using standard concentration inequalities. 
We proceed to show that under the assumptions of Theorem [3j the reduced OMP residuals ' 
indeed satisfy (1691) for all i E [L], i E [n^], and s E [s m « A dp] with probability meeting the lower 
bound in Theorem [3j 

Consider the reduced OMP algorithm for the data point x^ with fixed £ € [L\ and hxed 
i E [np\. We start by noting that the reduced OMP index set A s is a function of the data points 
in Xp only. After iteration s, with x-^ = and inserted into ([3]), we get 

ri^ = where 

fW := (I- 

We next establish a lower bound on the RHS of (l69l) and an upper bound on the LHS of ([69]) . To 
isolate the impact of the different random quantities in the statistical data model, we will introduce 
events, upon the intersection of which (1691) is implied by © via these bounds. A union bound on 
the probability of the intersection of these events then yields the final result. 

We start by lower-bounding the RHS of (l69j) according to 


max 

je[n<?] : j+i 


x W r w 

x i ’ r s 


> - 

“ 4 


1 /log pt 
di 


~0~ n 


> - 

“ 4 


1 / l og pe 

di 


( 1 - 5 ) 




-M 


where (1701) and (17T1) hold on the events 


°1 — 


max 


x^ )T $U W v 


>4 


1 / log Pi 

di 


-ov, 




v|| 2 , Vv E 


(70) 

(71) 


and 

T 2 := |rnin f 7 min (uW T $ T ^UW) > 1 - s\ , 5 E (0,1), 

respectively. Note that ; in (1701) not being statistically independent of the x) , j 7 ^ i, is not 

an issue as we consider (1701) on the event £^’^ and the inequality in the definition of applies 
to all v E Since = <1>U^ has full rank on £2, reduced OMP terminates after s max A di 
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iterations. To see this, simply note that for of full rank we need exactly di points from T^\{xj } 

I t— i'ii./n n. i i-* i i ii 11 ii 111 11 i ii i 11 i 1^1 ^ ^ ^ 


to represent = V^a-^ (owing to the fact that the sSj\ j E [n^], are i.i.d. uniform on S de 

and thus is = 0 after exactly di iterations. 

We continue by upper-bounding the LHS of (1691) according to 


max 

k±l,j 



= max 

a (fc) T u (fc) r $ T $u (t) f .(£) 

\ 3 s / 

k^i,j 

J s 


= max 

k^i,j 

< max 

k^l,j 


(fc) T u (fe) T u W f W + a f )T U (fc)T ($ T $ - I)U w f^ 

af )T U ( fc ) T ($ T $ - I)U W fW 


1 ( fc ) T u (fc) T u m f w 


+ max 

k&,j 


U( fc ) T uW 


< 4(3 log N + log s max ) max 

k^i v4v®« 


M 


+ 


6 log N + 2 log 


d\\ 


< 4(3 log N + log s max ) max 


U( fc ) T ($ T $- I ) U W 
U(fc) T uW 

-M 


2—>2 




kj^l \fdk\fdi 


/ 61ogiV + 21og s max ^ 


d u 


-M 




-M 


(72) 


(73) 

(74) 


Here, (1721) holds on the intersection of the events 


£ 


(£,i,s) _ 


max 

k^i,j 


af) T U (fc)T ($ T $ - I)U W f^ 


< 


6 log N + 2 log ^ 


Clr, 


u( fc ) r ($ r $ - I)U W 


2—>2 




£ 




max 


a< fc) U« T uWfW 

J * 


< 4(3 log N + log s max ) max 


U(*) T uW 


k+t Vd~kVd t 


;(<) 


and (1751) holds on the event 


£5 := < max 

1 k,e-. k^l 


uW T ($ t $-I)U 




2—>2 


<sy 


~(£) 

Recall that the notation is ' does not reflect dependence on the index i. We do, however, make the 
dependence of and £ 


Finally, setting S := 


on i explicit. 

28 dmax+ 8 k)g L+2r - n ([711) follows from assumption (0. This is seen 
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as follows: 


max 

kj^t 


u( fc ) T uW 


y/dk 


+ 


2^/6 log IV + 2 log s n 


< max aff(«5u.,«S>)-I— 
“ k,t. k^e y ’ ' 2 


5 d„ 


< 


3 VlogPmiii 
200 log N 


< 


< 


\/!og Pmin 


50(log N + (log s max )/3) 
_yiog pi _ 

48 (log N + (log s max )/3) 


(1-5), 


(75) 

(76) 

(77) 

(78) 


where (f76l) is by Q and (1771) follows by noting that (200/3) log IV = 50(loglV + (logIV)/3) > 
50(logIV + (logs max )/3). Furthermore, we have 


Vlog Pmin 
log N + (log 

^max )/3 


< 1 


(79) 


as a consequence of p m i n = min^(n^ — 1 )/de < N/d m ; n , IV > 3, and d m\n > 1. Next, (1791) combined 
with \/d m i n /d max < 1, maxfc^ : aff(5fc, 5f) > 0, and (1771) . implies that 5 < which yields 

^ < -j^(l — 5) and hence establishes (1751) . Finally, (1711) is obtained by rewriting the relation 
between the RHS of (1771) and (1751) . 

Note that the lower bound (1711) on the RHS of (1691) and the upper bound (1711) on the LHS of 
(1691) are equal; we have therefore established that, for fixed (i,£), obeys (1691) on £ l 'f’^ 

D £^' l ' s ' > n £ 5 . It finally follows that on the event £* := s H £2 D £^’ l ’ s ^ n £^' l ' s ' ) n £5, the 

graph G obtained by SSC-OMP applied to the full data set <F\{x^} has no false connections. It 
remains to lower-bound P[£*]. Specifically, we have 


P[£*] = l-P 

> 1 — P [£ 2 ] - P |c 5 


> 1 — 4e r / 2 — ^ n^e 
te[L] 


E (p 

£e[L\,ie[n e ] \ 
rpidu _ 

N’ 


g m 


+ E (■ 

selSmaxAdf] 




where the last inequality follows from 


c l 


< e -y/pe d e 


2 ^ 


< 2e" r/2 


‘-'3 

p(bM)' 

04 


< 


< 


-’max 

2 


N 2 


Sr 


^max 

PEI <2 


+ P 


C 4 


Here, (1851) corresponds to (1771) , while the proofs of (1511) - (1571) are presented below. 


(80) 


(81) 

(82) 

(83) 

(84) 

(85) 
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Proof of (1811) : Since A^j has full column rank with probability 1, it follows from Lemma [4] below 
that 


a W T ijW T $ t $U W i 


>r(V( A^)) 


uW t $ t $u w v 


> r 


(P(A^))a min (uW T $ T $uW) 


12 1 


for all v E R rf L We therefore have 


p s[ e,i) = P 

A M t uW t $ t $uW v 

< p 

r(P(A»)) < 

< e~ 

? 


( 86 ) 


where ( 1861 ) follows from (l 66 l) . which uses the assumption ( n^ — 1) / dp = pi > po > 1 . 

Lemma 4. For a matrix A € R mxn of full column rank and v £ R m ; it holds that 

||A T v|| oo >r(P(A))||v|| 2 , (87) 

where r(V( A)) is the inradius of the symmetrized convex hull V(A) of the columns of A. 

Proof. The inequality (1H71) obviously holds for v = 0. Pick any v £ R m \{0} and take e £ (0,1). 
Let r] = e||v|| 2 r('P(A)) and assume that v £ pV°(A) = {z: ||A r z|| < 77 } , i.e., v is an element of 

the ry-scaled version of the polar set V°(A). Note that p > 0 as ||v || 2 > 0, e > 0, and r(V( A)) > 0 
thanks to A having full column rank. It follows from m Thm. 1 . 2 ] that 

- R,r(A)) = PPX~y < 88 > 

Now, owing to p = e||v|| 2 r(P(A)), (| 88 l) implies that e > 1, which contradicts e £ (0,1). It therefore 
follows that v £ R m \{r/P°(A)} for all e £ (0,1), which in turn implies that HA^vU > p = 
e||v|| 2 r(P(A)) for all e £ (0,1). In particular, letting e —>• 1 yields HA^VH > ||v|| 2 r(P(A)) as 
desired. □ 


Proof of (1821) : With <r m i n (A) = 11A 1 || 2 _ 1 >2 [32 Sec. 5.2.1] for a full rank matrix A £ R mxm it 
follows that 


P[^] =P 

min 

l 

(U (f)T $ T $U^) 1 

-1 

<1 — 5 

2—»2 

= P 

max 

i 

(uW T $ T $U w ) _1 

1 1 

> 

2—>2 1 — 5 


< 2 e" r/2 , 

where r > 0 is the numerical constant in Theorem [3] and the last inequality is thanks to (PHI) . 
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Proof of (1831) : By the union bound 


^3 

IA 

M 

-1 

a W T u( fc ) T ($ T $ - I)U m fW 





sE p 


> 

\ r „air 


6 log IV + 2 log s r 


dm 


u( fc ) T ($ T $-I)U w 
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2—>2 
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af } U (fc)T ($ T $-I)U w f^ 
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16 log N + 2 log Sv 


dk 


U( fc ) T ($ r $ -I)u (£) f 




v-^ 2 2 

s E—*< 


k^l,j 


N 3 

K 1 v 


N 2 ’ 


(89) 


where (1891) follows from Proposition |T] with a = , b = — I)U^f^\ and f5 = 

V61ogIV + 21ogs max . 


Proof of (IMP : We first show that ?^/||?i ^|| 2 is distributed uniformly at random on S dt 1 ; (IMl) 
then follows by application of Lemma [3j 

(V,) 

Recall that we consider reduced OMP, which computes a sparse representation of x) = 
<&U^a,-^ using the columns of x))] = A 1 ))] as dictionary elements, i.e., A s and r[^ de¬ 
pend only on the random quantities <&U^, and In order to reflect these restricted 

dependencies, we write = r^(<&U^, af\ A^f]) and A s = A S (<I>U^, a!f\ A^f], ). Here, the first 
argument specifies the basis matrix of the data points, the second argument corresponds to the 
coefficient vector of the data point (in the basis specified by the first argument) a sparse repre¬ 
sentation is to be computed for, and the third argument designates the coefficient matrix of the 

dictionary elements (again in the basis specified by the first argument). 

~(l) . 

We start by showing that the distribution of ' is rotationally invariant. For a deterministic 
unitary matrix W G we have 

A S (^UWW T , Waf, WAlJ) = A s (*uW,af, A^) 

as the can be written as x^ = <l>U^a^ = < FU^W r Wa) f \ 

Using the shorthand notation A' s for A S (<E»U^W T , Waf*, WA^]) and recalling that = 
(I - A 2 (^UWAj ) ) t ^uW)af ) , it follows that 

if)($U w W T , Waf } , WA^) = (i - WAJJ ($U w W T WAj } y$U W W T ) Waf } 

= (i - WAjg ($uWw T WA2) f $U w W T ) Waf } 

= W(l - Ajg ($UW A<g)W*))af 

= WfW($U w ,af } ,A2). 
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By rotational invariance of the distributions of a.f\ A^f], and $ (by assumption in Theorem [3]), we 
have Waf } - af } ,WA^ - A^], and $U(% T ~ (b ecause span(U^W T ) = span(U^) 

and the columns of U'^W 7 are orthonormal). We therefore get 

rW($U W ,af } , A^j) ~ rW($U W W r , Waf\WA^]) 

= WiW($uW,af ) ,A| < j). (90) 


Since (l90l) holds for all unitary matrices W, the distribution of is rotationally invariant and 
|f^|| 2 is, indeed, distributed uniformly on S d * — - 1 . We finally exploit this property of to 




upper-bound P 


c-4 


as follows. A union bound over all k, k ^ £, yields 


P 


'-p{e,i,sy 

04 


SEP 


AW r uW T U®rf 



Xj( fc ) T uW 



- 

> 4(3 log N + log S max )- 

OO 

V dk y/ dg 

F 

± s 

2 


^ v "\ Tif* H - 1 N — ng + L — 1 N -\- L ^ 2 

“^WV3 = WiV 3 < S max IV 3 “ S max V 2 ’ 


where (f9Tl) follows by application of Lemma [3] with L = A^ k \ a = f^/||f^|| 2 , B = U^ T U^, and 
C = 4(3 log N + log Smax)- 


References 

[1] N. Ailon and E. Liberty. An almost optimal unrestricted fast Johnson-Lindenstrauss transform. 
ACM Trans. Algorithms , 9(3): 1-12, 2013. 

[2] David Alonso-Gutierrez. On the isotropy constant of random convex sets. Proc. Amer. Math. 
Soc., 136(9):3293-3300, 2008. 

[3] R. Basri and D.W. Jacobs. Lambertian reflectance and linear subspaces. IEEE Trans. Pattern 
Anal. Mach. Intel!, 25(2):218-233, 2003. 

[4] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004. 

[5] Broad-Institute. Cancer program data sets, 2013. 

http://www.broadinstitute.org/cgi-bin/cancer/datasets.cgi. 

[6] E. J. Candes, J. Romberg, and T. Tao. Robust uncertainty principles: Exact signal reconstruc¬ 
tion from highly incomplete frequency information. IEEE Trans. Inf. Theory, 52(2):489-509, 
2006. 

[7] E. L. Dyer, A. C. Sankaranarayanan, and R. G. Baraniuk. Greedy feature selection for subspace 
clustering. Journal of Mach. Learn. Research, 14:2487-2517, 2013. 

[8] E. Elhamifar and R. Vidal. Sparse subspace clustering. In Proc. of IEEE Conf. Comput. 
Vision Pattern Recogn., pages 2790-2797, 2009. 

[9] E. Elhamifar and R. Vidal. Sparse subspace clustering: Algorithm, theory, and applications. 
IEEE Trans. Pattern Anal. Machine IntelL, 35(11):2765-2781, 2013. 


35 

























[10] S. Foucart and H. Rauhut. A Mathematical Introduction to Compressive Sensing. Springer, 
Berlin, Heidelberg, 2013. 

[11] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. From few to many: Illumination 
cone models for face recognition under variable lighting and pose. IEEE Trans. Pattern Anal. 
Mach. Intelligence, 23(6):643-660, 2001. 

[12] G. H. Golub and C. F. Van Loan. Matrix Computations. JHU Press, 1996. 

[13] P. Gritzmann and V. Klee. Inner and outer j-radii of convex bodies in finite-dimensional 
norrned spaces. Discrete Comput. Geom., 7(l):255-280, 1992. 

[14] T. Hastie and P. Y. Simard. Metrics and models for handwritten character recognition. Stat. 
Sci., 13(1) :54—65, 1998. 

[15] R. Heckel, E. Agustsson, and H. Bolcskei. Neighborhood selection for thresholding based 
subspace clustering. In Proc. of IEEE Int. Conf. Acoust. Speech Sig. Proc. (ICASSP), pages 
6761-6765. IEEE, 2014. 

[16] R. Heckel and H. Bolcskei. Robust subspace clustering via thresholding. IEEE Trans. Inform. 
Theory, 61(ll):6320-6342, 2015. 

[17] R. Heckel, M. Tschannen, and H. Bolcskei. Subspace clustering of dimensionality-reduced 
data. In Proc. of IEEE Int. Symp. on Inf. Theory, pages 2997-3001. IEEE, July 2014. 

[18] R. A. Horn and C. R. Johnson. Matrix analysis. Cambridge University Press, 2012. 

[19] Jiaji Huang, Qiang Qiu, and Robert Calderbank. The role of principal angles in subspace 
classification, preprint, arXiv:1507.04230, 2015. 

[20] D. Jiang, C. Tang, and A. Zhang. Cluster analysis for gene expression data: A survey. IEEE 
Trans. Knowl. Data Eng., 16(11):1370-1386, 2004. 

[21] F. Krahmer and R. Ward. New and improved Johnson-Lindenstrauss embeddings via the 
restricted isometry property. SIAM J. Math. Anal., 43(3):1269-1281, 2011. 

[22] A. Lapidoth. A foundation in digital communication. Cambridge University Press, 2009. 

[23] Y. LeCun and C. Cortes. The MNIST database, 2013. http://yann.lecun.com/exdb/mnist/. 

[24] K. C. Lee, J. Ho, and D. J. Kriegman. Acquiring linear subspaces for face recognition under 
variable lighting. IEEE Trans. Pattern Anal. Mach. IntelL, 27(5):684-698, 2005. 

[25] K. Liu, H. Kargupta, and J. Ryan. Random projection-based multiplicative data perturbation 
for privacy preserving distributed data mining. IEEE Trans. Knowl. Data Eng., 18(1):92-106, 
2006. 

[26] A. Ng, I. M. Jordan, and W. Yair. On spectral clustering: Analysis and an algorithm. In 
Advances in Neural Information Processing Systems, pages 849-856, 2001. 

[27] M. Nokleby, M. Rodrigues, and R. Calderbank. Discrimination on the Grassmann manifold: 
Fundamental limits of subspace classifiers. IEEE Trans. Inform. Theory, 61(4):2133-2147, 
2015. 


36 


[28] M. Soltanolkotabi and E. J. Candes. A geometric analysis of subspace clustering with outliers. 
Ann. Stat., 40(4):2195-2238, 2012. 

[29] M. Soltanolkotabi, E. Elhamifar, and E. J. Candes. Robust subspace clustering. Ann. Stat., 
42(2):669-699, 2014. 

[30] D. Spielman. Spectral graph theory. Lecture notes, 2012. 

[31] M. Tschannen. Dimensionality reduction for sparse subspace clustering. MS thesis, ETH 
Zurich, March 2014. 

[32] S. S. Vempala. The Random Projection Method. American Mathematical Society, 2005. 

[33] R. Vershynin. Introduction to the non-asymptotic analysis of random matrices. In Y. C. 
Eldar and G. Kutyniok, editors, Compressed sensing: Theory and applications, pages 210-268. 
Cambridge University Press, 2012. 

[34] R. Vidal. Subspace clustering. IEEE Signal Process. Mag., 28(2):52-68, 2011. 

[35] U. von Luxburg. A tutorial on spectral clustering. Stat. Comput., 17(4):395-416, 2007. 

[36] Y. Wang, Y. Wang, and A. Singh. A deterministic analysis of noisy sparse subspace clustering 
for dimensionality-reduced data. In Proc. of Int. Conf. on Machine Learning (ICML), pages 
1422-1431, 2015. 

[37] C. You and R. Vidal. Sparse subspace clustering by orthogonal matching pursuit. July 2015. 

[38] T. Zhang, A. Szlam, Yi Wang, and G. Lerman. Hybrid linear modeling via local best-fit flats. 
Int. J. Comput. Vision, 100:217-240, 2012. 


37 



